Skip to content

[ENH] V1 → V2 API Migration - datasets#1608

Open
JATAYU000 wants to merge 398 commits into
openml:mainfrom
JATAYU000:dataset_resource
Open

[ENH] V1 → V2 API Migration - datasets#1608
JATAYU000 wants to merge 398 commits into
openml:mainfrom
JATAYU000:dataset_resource

Conversation

@JATAYU000

@JATAYU000 JATAYU000 commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

Metadata

@codecov-commenter

codecov-commenter commented Jan 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.44012% with 518 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.64%. Comparing base (fabbc28) to head (f6dc7bd).

Files with missing lines Patch % Lines
openml/_api/resources/dataset.py 19.00% 405 Missing ⚠️
openml/_api/clients/minio.py 20.00% 48 Missing ⚠️
openml/datasets/dataset.py 11.11% 40 Missing ⚠️
openml/datasets/functions.py 11.11% 16 Missing ⚠️
openml/utils/_openml.py 36.36% 7 Missing ⚠️
openml/_config.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1608       +/-   ##
===========================================
- Coverage   81.50%   27.64%   -53.87%     
===========================================
  Files          63       63               
  Lines        5240     5593      +353     
===========================================
- Hits         4271     1546     -2725     
- Misses        969     4047     +3078     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JATAYU000

JATAYU000 commented Jan 9, 2026

Copy link
Copy Markdown
Contributor Author

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

  • download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

  • The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
  • Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
  • Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

@geetu040 geetu040 mentioned this pull request Jan 9, 2026
18 tasks

@geetu040 geetu040 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

Comment thread openml/_api/resources/base/resources.py
Comment thread openml/_api/resources/dataset.py
Comment thread openml/_api/resources/datasets.py Outdated
bool
True if the deletion was successful. False otherwise.
"""
return openml.utils._delete_entity("data", dataset_id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes Sense , It would look like a delete request from client along with exception handling

Comment on lines +456 to +461
def list(
self,
limit: int,
offset: int,
**kwargs: Any,
) -> pd.DataFrame:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, it can use private helper methods

Comment thread openml/datasets/functions.py Outdated
# Minimalistic check if the XML is useful
if "oml:data_qualities_list" not in qualities:
raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
from openml._api import api_context

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does raise circular import

@geetu040

Copy link
Copy Markdown
Collaborator

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

Thanks for a detailed explanation, I now have good understanding of the download mechanism.

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request

minio can be handled easily, we will use a separate client along with HTTPClient or implement it's methods in the HTTPClient, which work independently of the api version

  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml

  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

these are actually different objects in both apis, v1 uses xml and v2 keeps them in json

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times

yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format xml or json, slightly different structure, but if parsed most are identical, so shouldn't this rule be applied to all the responses?

Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.

I don't understand this point

Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

agreed, should be handled by api_context

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

agreed, adding return_path for just one specific method of one resource is not preffered


in conclusion, I may ask, if we ignore the fact that it downloads the .arff files for both versions separately, does everything else works out smooth without any blocker? I think ignoring this part is not really bad because conceptually this rule could be applied to almost every other response object

@JATAYU000

JATAYU000 commented Jan 15, 2026

Copy link
Copy Markdown
Contributor Author

@geetu040 making a new client for minio handles just the parquet file, we would still need to migrate download_text_file() for the arff file (this is also used by tasks and runs)
So maybe we can have a DownloadClient which can contain all of these along with a save method which can save content to a specified path and hence also fixes our issue with features/qualities path ?

FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle

  • return path of saved file (feature, qualities, arff, pq)
  • downloader for arff or implementation of download_text_file() which is used for arff download
  • minio file and bucket download for the pq file

@geetu040

Copy link
Copy Markdown
Collaborator

From the standup discussion and earlier conversations, I think we can agree on a few points:

  • Have a separate client for MinIO interactions alongside HTTPClient. In future if we plan to add more providers like dropbox, google drive, e.t.c, we don't end up with too many changes, rather have their client implemented as a separate class and just pass that down to relevant resource.
  • DownloadClient doesn't feel like the right abstraction; instead, implement download-specific methods directly in HTTPClient.

Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs.

@JATAYU000

Copy link
Copy Markdown
Contributor Author

The points do make sense to me, I will propose the design along with how it would be used in the resource.

@JATAYU000

Copy link
Copy Markdown
Contributor Author

@geetu040 I have a design implemented which needs reviews

MinIOClient similar to HTTPClient is being used by DatasetAPI from self._minio , It implements 2 methods download file and download bucket, it uses _get_cache_dir() for the destination
New method download implemented under HTTPclient that can be used for features,qualities and arff files, along with specific v1/v2 interface using handler callback.

Question:

  • most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users
  • Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

@geetu040

Copy link
Copy Markdown
Collaborator

@geetu040 I have a design implemented which needs reviews

I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.

most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users

Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the CacheMixin class. How does that sound?

Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

CacheMixin._set_cache_response will look at the response object and extract json or xml content from it and save it respectively in .json and .xml files.
CacheMixin._get_cache_response will read these files (.json or .xml) from the given path and create a dummy Response object then fill it with status_code and content. Therefore a Response object will be returned.

@JATAYU000

JATAYU000 commented Jan 19, 2026

Copy link
Copy Markdown
Contributor Author

CacheMixin._set_cache_response will look at the response object and extract json or xml content from it and save it respectively in .json and .xml files.
CacheMixin._get_cache_response will read these files (.json or .xml) from the given path and create a dummy Response object then fill it with status_code and content. Therefore a Response object will be returned.

This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?

we can completely remove these lines and always rely on the CacheMixin class. How does that sound?

Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them

@geetu040

Copy link
Copy Markdown
Collaborator

This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?

I am not sure about that, would require a detailed review

Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them

yes, that is expected

Comment thread openml/_api/resources/dataset.py Outdated
Comment thread openml/_api/http/client.py Outdated
parsed_url = urllib.parse.urlparse(source)

# expect path format: /BUCKET/path/to/file.ext
_, bucket, *prefixes, _file = parsed_url.path.split("/")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_file should be renamed to _ given it is never called, surprised ruff does not call it out.

Comment thread openml/datasets/functions.py Outdated
Parameters
----------
data_id : list, optional
dataset_id : list, optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be data_id

Comment thread openml/datasets/functions.py Outdated
def _get_dataset_parquet(
description: dict | OpenMLDataset,
cache_directory: Path | None = None,
cache_directory: Path | None = None, # noqa: ARG001

@satvshr satvshr Jan 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the ARG001 addition, I assume ruff checks were passing before this was added too?

Edit1: Just saw that this is not being used, why not remove it then?

Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet

Comment thread openml/datasets/functions.py
Comment thread openml/datasets/functions.py
Comment thread openml/datasets/functions.py

@satvshr satvshr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, will look at this again once it is actually ready for review with all implementations complete.

@satvshr

satvshr commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

@JATAYU000

Copy link
Copy Markdown
Contributor Author

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

If you are talking about the invalid api_key regex match in v2 you can set this in the config

key="AD000000000000000000000000000000"

@SubhadityaMukherjee

Copy link
Copy Markdown
Contributor

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

If you are talking about the invalid api_key regex match in v2 you can set this in the config

key="AD000000000000000000000000000000"

Um why are we discussing API keys here. Even if its just for testing.

@geetu040 geetu040 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdk code look good so far, please take a look at #1575 (comment) and make changes accordingly where needed.
all tests (existing and new) should pass to make sure we are retaining the original functionality of the sdk

Comment thread openml/_api/clients/http.py
Comment thread openml/_api/resources/base/resources.py Outdated
) -> pd.DataFrame: ...

@abstractmethod
def delete(self, dataset_id: int) -> bool: ...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove it from here as well
see point 5 in #1575 (comment)

Comment thread openml/datasets/functions.py Outdated

did_cache_dir = _create_cache_directory_for_id(
DATASETS_CACHE_DIR_NAME,
return api_context.backend.datasets.get(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the functionality of force_refresh_cache, you can use reset_cache for http.get
see point 1 in #1575 (comment)

@SimonBlanke

Copy link
Copy Markdown
Collaborator

@JATAYU000 The recent changes in the base branch fixes the test fail in the windows test job. Please update the base branch.

Comment thread tests/test_datasets/test_dataset.py
Comment thread openml/_api/resources/dataset.py Outdated
original_data_url: str | None = None,
paper_url: str | None = None,
) -> int:
raise NotImplementedError(self._not_supported(method="edit"))

@EmanAbdelhaleem EmanAbdelhaleem Feb 4, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()

Comment thread tests/test_api/test_datasets.py Outdated
class TestDatasetV1API(TestAPIBase):
def setUp(self):
super().setUp()
self.client = self._get_http_client(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is V1, using
self.client = self.http_client will do.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the recent change, it can be used from self.http_clients[APIVersion.V1]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have made the change

Copilot AI review requested due to automatic review settings May 24, 2026 03:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 26 changed files in this pull request and generated 3 comments.

Comment thread openml/datasets/dataset.py Outdated
Comment thread openml/_api/resources/dataset.py Outdated
Comment thread openml/_api/resources/dataset.py Outdated
Copilot AI review requested due to automatic review settings May 24, 2026 05:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings May 28, 2026 11:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 26 changed files in this pull request and generated 9 comments.

Comment thread openml/_api/clients/minio.py
Comment thread openml/_api/clients/minio.py
Comment thread openml/_api/clients/minio.py
Comment thread openml/_api/clients/http.py Outdated
Comment on lines +105 to +109
try:
arff.loads(response.text)
return "body.arff"
except arff.ArffException:
pass

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geetu040 What is your opinion on this suggestions?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this has been addressed in #1616, which should get merged first

Comment thread openml/_api/clients/http.py Outdated
Comment on lines +114 to +125
candidates = []
for p in path.iterdir():
if p.name.startswith("body.") and len(p.suffixes) == 1:
candidates.append(p)

if (path / "body.xml").exists():
return "body.xml"
if not candidates:
raise FileNotFoundError(f"No body file found in path: {path}")

return "body.txt"
if len(candidates) > 1:
raise FileNotFoundError(
f"Multiple body files found in path: {path} ({[p.name for p in candidates]})"
)
Comment thread openml/utils/_openml.py Outdated
Comment thread openml/_api/resources/dataset.py Outdated
Comment thread tests/test_api/test_datasets.py
Comment thread openml/datasets/dataset.py
@JATAYU000 JATAYU000 requested a review from geetu040 May 28, 2026 13:27

@geetu040 geetu040 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, we are close to finish this, just left a few minor comments.

Comment thread openml/_api/resources/dataset.py Outdated
class DatasetV1API(ResourceV1API, DatasetAPI):
"""Version 1 API implementation for dataset resources."""

@openml.utils.thread_safe_if_oslo_installed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant, we already apply lock on get_dataset in datasets/functions.py

Comment thread openml/_api/resources/dataset.py Outdated
class DatasetV2API(ResourceV2API, DatasetAPI):
"""Version 2 API implementation for dataset resources."""

@openml.utils.thread_safe_if_oslo_installed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment thread openml/datasets/functions.py Outdated
return openml._backend.dataset.delete_topic(data_id, topic)


def _get_dataset_description(did_cache_dir: Path, dataset_id: int) -> dict[str, Any]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function still uses old api calls, though it's never really used in the code base, except for the tests. I'd suggest to remove this entirely from everyhwere

Comment thread openml/utils/_openml.py


@contextlib.contextmanager
def file_lock(lock_path: str) -> Iterator[None]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Did you have to add this because of the failing tests in CI?
  2. We already have a lock in the codebase: thread_safe_if_oslo_installed, this is different because it applies lock at file levels instead of functions
  3. Do you think this file_lock could be avoided inside functions and replaced by thread_safe_if_oslo_installed at some function level.
  4. If we are keeping it, can we align this function more with thread_safe_if_oslo_installed, which would mean giving a similar name and if possible keeping file_lock (resource/file-scoped) and thread_safe_if_oslo_installed (function/id-scoped) as separate wrappers over one shared external-lock primitive.

Comment thread tests/test_api/test_datasets.py Outdated
output = dataset_v1.get_qualities(2)

assert isinstance(output, dict)
assert len(output.keys()) == 107

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite hardcoded, should be avoided, may be >= 100 would be my suggestion, similarly in other places for other v1/v2 and get/list_quantiles.

assert output

output = dataset_v1.feature_remove_ontology(did, fid, ontology)
assert output

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really check if the functions actually work, can you check the output or the ontologies if they have been updated after each call.

openml.config.apikey = TestBase.admin_key
topic = f"test_topic_{str(time.time())}"
dataset_v1.add_topic(31, topic)
dataset_v1.delete_topic(31, topic)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@pytest.mark.test_server()
def test_v2_add_delete_topic(dataset_v2):
with pytest.raises(OpenMLNotSupportedError):
dataset_v2.add_topic(2, 'test_topic_' + str(time.time()))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably update for delete as well

Copilot AI review requested due to automatic review settings June 13, 2026 02:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 26 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

tests/test_tasks/test_task_functions.py:1

  • This assertion assumes the dataset was downloaded as parquet from MinIO and that _parquet_url is non-None. In practice, the task download may result in ARFF-only (e.g., parquet skipped via env var, parquet missing, or fallback). To make the test robust, assert that either the parquet path exists or the ARFF data path exists (or fetch the dataset with download_data=True and assert dataset.data_file/dataset.parquet_file exist accordingly).
# License: BSD 3-Clause

Comment thread openml/utils/_openml.py
Comment on lines +458 to +473
@contextlib.contextmanager
def file_lock(lock_path: str) -> Iterator[None]:
"""Context manager that uses `oslo_concurrency.lockutils.external_lock`

The oslo-based locks are placed in the centralized cache `locks` folder
returned by ``_create_lockfiles_dir()``. A deterministic name derived from
the lock path is used to avoid collisions.
"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
from oslo_concurrency import lockutils

lock_dir = _create_lockfiles_dir()
name = hashlib.sha256(str(lock_path).encode()).hexdigest()
with lockutils.external_lock(name=name, lock_path=str(lock_dir)):
yield
Comment on lines +75 to +81
if proxy == "auto":
resolved_proxies = requests.utils.get_environ_proxies(parsed_url.geturl())
proxy = requests.utils.select_proxy(parsed_url.geturl(), resolved_proxies) # type: ignore

proxy_client = ProxyManager(proxy) if proxy else None

client = minio.Minio(endpoint=parsed_url.netloc, secure=False, http_client=proxy_client)
Comment thread openml/_api/resources/dataset.py Outdated
Comment on lines +468 to +471
def get_qualities(self, dataset_id: int) -> dict[str, float] | None:
"""Get qualities of a dataset from server.

Parameters
Comment on lines 1077 to 1085
with qualities_pickle_file.open("rb") as fh_binary:
return pickle.load(fh_binary) # type: ignore # noqa: S301
except: # noqa: E722
with qualities_file.open(encoding="utf8") as fh:
qualities_xml = fh.read()

qualities = _parse_qualities_xml(qualities_xml)
qualities = openml._backend.dataset.parse_qualities_file(
qualities_file, qualities_pickle_file
)
with qualities_pickle_file.open("wb") as fh_binary:
pickle.dump(qualities, fh_binary)

return qualities
Comment on lines 410 to +411
filename_mock = mocker.patch("openml.datasets.dataset._get_features_pickle_file")
pickle_mock = mocker.patch("openml.datasets.dataset.pickle")
pickle_mock = mocker.patch("openml._api.resources.dataset.pickle")
Comment on lines 1324 to 1328
# The parquet file on minio with ID 128 is not the iris dataset from the test server.
dataset = openml.datasets.get_dataset(128, cache_format="feather")
# Workaround
dataset._parquet_url = None
dataset.parquet_file = None
Comment on lines +1114 to +1115
arff_file: Path | None = None,
parquet_file: Path | None = None,
Copilot AI review requested due to automatic review settings June 13, 2026 03:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 25 changed files in this pull request and generated 7 comments.


proxy_client = ProxyManager(proxy) if proxy else None

client = minio.Minio(endpoint=parsed_url.netloc, secure=False, http_client=proxy_client)
Comment on lines +90 to +92
if destination.is_file() and destination.suffix == ".zip":
with zipfile.ZipFile(destination, "r") as zip_ref:
zip_ref.extractall(destination.parent)
Comment on lines +133 to +137
file_destination = destination / file_object.object_name.rsplit("/", 1)[1]
if (file_destination.parent / file_destination.stem).exists():
# Marker is missing but archive exists means the server archive changed
# force a refresh
shutil.rmtree(file_destination.parent / file_destination.stem)
Comment on lines +375 to 383
return openml._backend.dataset.get(
dataset_id,
)

remove_dataset_cache = True
try:
description = _get_dataset_description(did_cache_dir, dataset_id)
features_file = None
qualities_file = None

if download_features_meta_data:
features_file = _get_dataset_features_file(did_cache_dir, dataset_id)
if download_qualities:
qualities_file = _get_dataset_qualities_file(did_cache_dir, dataset_id)

parquet_file = None
skip_parquet = (
os.environ.get(openml.config.OPENML_SKIP_PARQUET_ENV_VAR, "false").casefold() == "true"
)
download_parquet = "oml:parquet_url" in description and not skip_parquet
if download_parquet and (download_data or download_all_files):
try:
parquet_file = _get_dataset_parquet(
description,
download_all_files=download_all_files,
)
except urllib3.exceptions.MaxRetryError:
parquet_file = None

arff_file = None
if parquet_file is None and download_data:
if download_parquet:
logger.warning("Failed to download parquet, fallback on ARFF.")
arff_file = _get_dataset_arff(description)

remove_dataset_cache = False
except OpenMLServerException as e:
# if there was an exception
# check if the user had access to the dataset
if e.code == NO_ACCESS_GRANTED_ERRCODE:
raise OpenMLPrivateDatasetError(e.message) from None

raise e
finally:
if remove_dataset_cache:
_remove_cache_dir_for_id(DATASETS_CACHE_DIR_NAME, did_cache_dir)

return _create_dataset_from_description(
description,
features_file,
qualities_file,
arff_file,
parquet_file,
download_data,
cache_format,
download_qualities,
download_features_meta_data,
download_all_files,
force_refresh_cache,
)
Comment on lines +469 to +471
xml = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache).text
_ = self.download_features_file(dataset_id) # ensure the file is downloaded and cached
return self._parse_features_xml(xml)
Comment on lines +114 to +119
@unittest.mock.patch.object(requests.Session, "request", autospec=True, wraps=requests.Session.request)
@pytest.mark.test_server()
def test_list_all_few_results_available(_perform_api_call):
def test_list_all_few_results_available(mocked_request):
datasets = openml.datasets.list_datasets(size=1000, data_name="iris", data_version=1)
assert len(datasets) == 1, "only one iris dataset version 1 should be present"
assert _perform_api_call.call_count == 1, "expect just one call to get one dataset"
assert mocked_request.call_count == 1, "expect just one call to get one dataset"
Comment on lines +1668 to +1669
assert dataset_url == mock_delete.call_args.kwargs.get("url")
assert 'DELETE' == mock_delete.call_args.kwargs.get("method")
Copilot AI review requested due to automatic review settings June 13, 2026 03:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 25 changed files in this pull request and generated 6 comments.

Comment thread openml/utils/_openml.py
Comment on lines +458 to +473
@contextlib.contextmanager
def file_lock(lock_path: str) -> Iterator[None]:
"""Context manager that uses `oslo_concurrency.lockutils.external_lock`

The oslo-based locks are placed in the centralized cache `locks` folder
returned by ``_create_lockfiles_dir()``. A deterministic name derived from
the lock path is used to avoid collisions.
"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
from oslo_concurrency import lockutils

lock_dir = _create_lockfiles_dir()
name = hashlib.sha256(str(lock_path).encode()).hexdigest()
with lockutils.external_lock(name=name, lock_path=str(lock_dir)):
yield

proxy_client = ProxyManager(proxy) if proxy else None

client = minio.Minio(endpoint=parsed_url.netloc, secure=False, http_client=proxy_client)
Comment on lines +1080 to 1085
qualities = openml._backend.dataset.parse_qualities_file(
qualities_file, qualities_pickle_file
)
with qualities_pickle_file.open("wb") as fh_binary:
pickle.dump(qualities, fh_binary)

return qualities
Comment on lines +468 to +471
path = f"data/features/{dataset_id}"
xml = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache).text
_ = self.download_features_file(dataset_id) # ensure the file is downloaded and cached
return self._parse_features_xml(xml)
Comment on lines +491 to +502
path = f"data/qualities/{dataset_id!s}"
try:
xml = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache).text
except OpenMLServerException as e:
if e.code == 362 and str(e) == "No qualities found - None":
# quality file stays as None
logger.warning(f"No qualities found for dataset {dataset_id}")
return None

raise e
_ = self.download_qualities_file(dataset_id) # ensure the file is downloaded and cached
return self._parse_qualities_xml(xml)
Comment on lines +440 to +441
with pytest.raises(FileNotFoundError):
self._get_cache_filename(2)
Copilot AI review requested due to automatic review settings June 13, 2026 06:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 25 changed files in this pull request and generated 10 comments.

resolved_proxies = requests.utils.get_environ_proxies(parsed_url.geturl())
proxy = requests.utils.select_proxy(parsed_url.geturl(), resolved_proxies) # type: ignore

proxy_client = ProxyManager(proxy) if proxy else None
_, bucket, *prefixes, _ = parsed_url.path.split("/")
prefix = "/".join(prefixes)

client = minio.Minio(endpoint=parsed_url.netloc, secure=False)
Comment on lines +468 to +471
path = f"data/features/{dataset_id}"
xml = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache).text
_ = self.download_features_file(dataset_id) # ensure the file is downloaded and cached
return self._parse_features_xml(xml)
Comment on lines +491 to +502
path = f"data/qualities/{dataset_id!s}"
try:
xml = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache).text
except OpenMLServerException as e:
if e.code == 362 and str(e) == "No qualities found - None":
# quality file stays as None
logger.warning(f"No qualities found for dataset {dataset_id}")
return None

raise e
_ = self.download_qualities_file(dataset_id) # ensure the file is downloaded and cached
return self._parse_qualities_xml(xml)
Comment on lines 410 to +411
filename_mock = mocker.patch("openml.datasets.dataset._get_features_pickle_file")
pickle_mock = mocker.patch("openml.datasets.dataset.pickle")
pickle_mock = mocker.patch("openml._api.resources.dataset.pickle")
Comment on lines +124 to +143
@pytest.mark.test_server()
def test_v1_edit(dataset_v1):
did = 2
result = dataset_v1.fork(did)
_wait_for_dataset_being_processed(dataset_v1, result,'active')

edited_did = dataset_v1.edit(result, description="Forked dataset", default_target_attribute="shape")
assert result == edited_did
n_tries = 10
# we need to wait for the edit to be reflected on the server
for i in range(n_tries):
edited_dataset = dataset_v1.get(result, force_refresh_cache=True)
try:
assert edited_dataset.default_target_attribute == "shape", edited_dataset
assert edited_dataset.description == "Forked dataset", edited_dataset
break
except AssertionError as e:
if i == n_tries - 1:
raise e
time.sleep(10)
Comment on lines +133 to +137
file_destination = destination / file_object.object_name.rsplit("/", 1)[1]
if (file_destination.parent / file_destination.stem).exists():
# Marker is missing but archive exists means the server archive changed
# force a refresh
shutil.rmtree(file_destination.parent / file_destination.stem)
Comment on lines +1327 to +1328
feather_file = os.path.join(cache_dir,"data","v1","download","128","iris.arff", "body.feather")
pickle_file = os.path.join(cache_dir,"data","v1","download","128","iris.arff", "body.feather.attributes.pkl.py3")
Comment on lines +1651 to 1652
@mock.patch.object(requests.Session, "request")
def test_delete_dataset_not_owned(mock_delete, test_files_directory, test_server_v1, test_apikey_v1):
Comment on lines +1668 to 1670
assert dataset_url == mock_delete.call_args.kwargs.get("url")
assert 'DELETE' == mock_delete.call_args.kwargs.get("method")
assert test_apikey_v1 == mock_delete.call_args.kwargs.get("params", {}).get("api_key")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] V1 → V2 API Migration - datasets

10 participants